Overview

In this assignment I will use the medical cost personal data (insurance.csv)

Column

Graphical Displays Shown

  • Categorical Data
    • Bar Chart (Region & Smokers by Region)
    • Pie Chart (Number of Children)
  • Qualitative Data
    • Histogram (BMI & Charges)
    • Boxplot (BMI by Region & Charges by Number of Children)
    • Scatterplot (Age vs Charges for Smokers & Nonsmokers)

Column

Summary

The insurance data was used to explore distributions and relationships among variables in the dataset. From the first bar chart we can see that the most amount of people come from the southwest region. The second bar chart shows that there are more nonsmoker than smokers in each region. Next, the BMI histogram shows a roughly symmetrical distribution with a median around 40. The histogram of charges shows a very right skewed distribution with a median around 13000. Then, the horizontal side-by-side boxplots show that the BMI distributions are similar for all of the regions, but the southeast region has the highest values. The scatter plots all show the relationship of Age vs Charges. They show a slight positive linear relationship which means as age increases, so does the charges. Looking at scatterplots for Age vs Charges by smoker, we can see that there are three groups. The group with the lowest charges contains all nonsmokers. The group with the middle charges contains a mix of smokers and nonsmokers. The group with the highest charges conatins all smokers. From the pie chart, most people have zero children and the percentages get smaller as the amount of children goes up. Finally, the vertical side-by-side boxplots show that the people with zero children actually have the highest charges. People with 5 children have the lowest and the least variation.

Question 1&2-Data

Column

First 500 Observations

Column

Variables

There are 1338 observations and 7 variables in this dataset.

  • age: age of primary beneficiary
  • sex: insurance contractor gender, female, male
  • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m2) using the ratio of height to weight, ideally 18.5 to 24.9
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Smoking
  • region: the beneficiary’s residential area in the US, northeast, southeast, southwest,northwest.
  • charges: Individual medical costs billed by health insurance
A glimpse of the data:
Rows: 1,338
Columns: 7
$ age      <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex      <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…

Question 3

Column

  1. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.

Analysis

There are about the same amount of people from the northeast, northwest and southwest regions in this dataset, which are all around 325. The southeast region contains the most amount of people with around 375.

Column

Bar Plot of Region

Question 4

  1. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.

Bar Plot of Smokers by Region

Question 5

Column

  1. Create a histogram of bmi. Discuss the distribution of the histogram.

Analysis

The BMI distribution is roughly symmetrical. The median is around 40 and the range is about 15 to 55. There is a high observation that could be a slight outlier but other than that, there are no unusual features.

Column

Histogram of BMI

Question 6

Column

  1. Create a histogram of charges, Discuss the distribution of the histogram.

Analysis

This histogram is very right skewed. The median is about 13000 and the range is from around 1100 to 63000. Most of the people in this data set have charges under 20000 but some have very high charges which skews the data to the right.

Column

Histogram of Charges

Question 7

Column

  1. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)

Analysis

The BMIs for all the regions are similar with medians ranging drom about 28 to 33. The southeast region has a higher median and two large outliers. The southwest and northeast regions also have large outliers, but they are not as big as the ones in the southeast region.

Column

Boxplots of BMI by Region

Question 8

Column

  1. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.

Analysis

This scatter plot seems to contain three groups. Each of these groups have a slight positive linear relationship, so overall it seems like as age increases, so does the charges.

Column

Scatterplot of Age vs Charges

Question 9

Column

  1. You should find that it seems “charges” could be classified into several groups. Let’s create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.

Analysis

This scatterplot allows us to see what the different groups represent. The lowest groups contains all nonsmokers which means they have the lowest charges. The middle group is a mix of smokers and nonsmokers. The highest group contains all smokers which means they have the highest charges despite the age, but the charges continue to increase as age increases as well.

Column

Scatterplot of Age vs Charges with Smoker

Question 10

Column

  1. Now, create two data frames by subsetting insurance data as follows.
    smoker<-insurance[insurance$smoker==“yes”]
    nonsmoker<-insurance[insurance$smoker==“no”]

Smoker Data table

Column

Nonsmoker Data table

Question 11

Column

  1. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?

Analysis

This scatterplot is broken into two groups. The smooth line goes through the middle of these groups, so I don’t think it’s needed. However, the line does help make it more clear that there is a positive linear relationship so that is one benefit, but it is also something you can tell just from the points.

Column

Scatterplot Age vs Charges for Smokers

Question 12&13

Column

  1. Repeat Question 11 using the data frame nonsmoker.

Scatterplot Age vs Charges for Nonsmokers

Column

Analysis

This scatterplot is also broken up into two groups, but the lower group is much more condensed than the upper group. It is easy to tell the positive relationship in the lower group but not as much with the upper group which makes having the smooth line good.

Question 13

  1. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.

You could create a bar chart to show the charges for each gender or amount of children. A scatter plot could be created to compare charges to bmi. A box plot could be good to show the distribution of charges for each region.

Question 14

Column

  1. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)

Analysis

From looking at this pie chart, it is obvious that most people have a lower number of children. The most people have zero children and the percentages get smaller as the amount of children goes up.

Column

Pie Chart for Number of Children

Question 15

Column

  1. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.

Analysis

This boxplot actually surprised me because I expected the charges to be higher with the higher amount of children. However, the people with zero children actually have the highest charges. 2, 3, and 4 children are all similar and slightly higher than 1 child. 5 children is the lowest and has the least variation. There are many large outliers.

Column

Boxplot for Charges by Children

---
title: "Assignment 7"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: default
      navbar-bg: "lightblue"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
insurance<- read_csv("~/Desktop/MTH 209/insurance.csv")
```

Overview
===

In this assignment I will use the medical cost personal data (insurance.csv)

Column {data-width=500}
---

### Graphical Displays Shown
- Categorical Data
  - Bar Chart (Region & Smokers by Region)
  - Pie Chart (Number of Children)
  
- Qualitative Data
  - Histogram (BMI & Charges)
  - Boxplot (BMI by Region & Charges by Number of Children)
  - Scatterplot (Age vs Charges for Smokers & Nonsmokers)

Column {data-width=500}
---

### Summary
The insurance data was used to explore distributions and relationships among variables in the dataset. From the first bar chart we can see that the most amount of people come from the southwest region. The second bar chart shows that there are more nonsmoker than smokers in each region. Next, the BMI histogram shows a roughly symmetrical distribution with a median around 40. The histogram of charges shows a very right skewed distribution with a median around 13000. Then, the horizontal side-by-side boxplots show that the BMI distributions are similar for all of the regions, but the southeast region has the highest values. The scatter plots all show the relationship of Age vs Charges. They show a slight positive linear relationship which means as age increases, so does the charges. Looking at scatterplots for Age vs Charges by smoker, we can see that there are three groups. The group with the lowest charges contains all nonsmokers. The group with the middle charges contains a mix of smokers and nonsmokers. The group with the highest charges conatins all smokers. From the pie chart, most people have zero children and the percentages get smaller as the amount of children goes up. Finally, the vertical side-by-side boxplots show that the people with zero children actually have the highest charges. People with 5 children have the lowest and the least variation.

Question 1&2-Data
===

Column {data-width=450}
---

### <b><font size=4><span Style = "color:blue">First 500 Observations</span></font></b>

```{r show_table}
datatable(insurance[1:500,], rownames=FALSE, colnames= c("age", "sex", "bmi", "children", "smoker", "region", "charges"), options=list(pageLength=20))
```

Column {data-width=550}
---

### <font size= 4><span Style = "color:red">Variables</span></font>

There are 1338 observations and 7 variables in this dataset.

- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents 
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast, southwest,northwest.
- charges: Individual medical costs billed by health insurance

A glimpse of the data:
```{r}
glimpse(insurance)
```

Question 3
===

Column {data-width=500}
---

3. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.

### Analysis
There are about the same amount of people from the northeast, northwest and southwest regions in this dataset, which are all around 325. The southeast region contains the most amount of people with around 375.

Column {data-width=500}
---

### Bar Plot of Region
```{r bar1}
ggplot(insurance, aes(x=region))+
  geom_bar(fill="#00008B")+
  labs(title="Distribution of Region", x="Region", y="Number of People")+
  theme(text= element_text(size=20))
```

Question 4
===

4. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.

### Bar Plot of Smokers by Region
```{r bar2}
ggplot(insurance, aes(x=region, fill=smoker))+
  geom_bar(position="fill")+
  scale_y_continuous(breaks=seq(0,1,by=0.2), labels=scales::percent)+
  labs(title="Distribution of Smokers by Region", x="Region", y="Percent of People")+
  theme(text=element_text(size=18))
```

Question 5
===

Column {data-width=500}
---

5. Create a histogram of bmi. Discuss the distribution of the histogram.

### Analysis
The BMI distribution is roughly symmetrical. The median is around 40 and the range is about 15 to 55. There is a high observation that could be a slight outlier but other than that, there are no unusual features.

Column {data-width=500}
---

### Histogram of BMI
```{r hist1}
ggplot(insurance, aes(x=bmi))+
  geom_histogram(fill="#76EEC6")+
  labs(title="Distribution of BMI", x="BMI", y="Frequency")+
  theme(text=element_text(size=20))
```

Question 6
===

Column {data-width=500}
---

6. Create a histogram of charges, Discuss the distribution of the histogram.

### Analysis
This histogram is very right skewed. The median is about 13000 and the range is from around 1100 to 63000. Most of the people in this data set have charges under 20000 but some have very high charges which skews the data to the right.

Column {data-width=500}
---

### Histogram of Charges
```{r hist2}
ggplot(insurance, aes(x=charges))+
  geom_histogram(fill="#9A32CD")+
  labs(title="Distribution of Charges", x="Charges", y="Frequency")+
  theme(text=element_text(size=20))
```

Question 7
===

Column {data-width=500}
---

7. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)

### Analysis
The BMIs for all the regions are similar with medians ranging drom about 28 to 33. The southeast region has a higher median and two large outliers. The southwest and northeast regions also have large outliers, but they are not as big as the ones in the southeast region.

Column {data-width=500}
---

### Boxplots of BMI by Region
```{r box1}
ggplot(insurance, aes(x=bmi, y=region))+
  geom_boxplot(fill="#FFF0F5")+
  labs(title="Distribution of BMI by Region", x="BMI", y="Region")+
  theme(text=element_text(size=20))
```

Question 8
===

Column {data-width=500}
---

8. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.

### Analysis
This scatter plot seems to contain three groups. Each of these groups have a slight positive linear relationship, so overall it seems like as age increases, so does the charges.

Column {data-width=500}
---

### Scatterplot of Age vs Charges
```{r scatter1}
ggplot(insurance, aes(x=age, y=charges))+
  geom_point(color="#EE6363")+
  labs(title="Age vs Charges", x="Age", y="Charges")+
  theme(text=element_text(size=20))
```

Question 9
===

Column {data-width=500}
---

9. You should find that it seems "charges" could be classified into several groups. Let's create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.

### Analysis
This scatterplot allows us to see what the different groups represent. The lowest groups contains all nonsmokers which means they have the lowest charges. The middle group is a mix of smokers and nonsmokers. The highest group contains all smokers which means they have the highest charges despite the age, but the charges continue to increase as age increases as well.

Column {data-width=500}
---

### Scatterplot of Age vs Charges with Smoker
```{r scatter2}
ggplot(insurance, aes(x=age, y=charges, color=smoker))+
  geom_point()+
  labs(title="Age vs Charges", x="Age", y="Charges")+
  theme(text=element_text(size=20))
```

Question 10
===

Column {data-width=500}
---

10. Now, create two data frames by subsetting insurance data as follows.  
smoker<-insurance[insurance\$smoker=="yes"]  
nonsmoker<-insurance[insurance\$smoker=="no"]
```{r}
smoker <- insurance[insurance$smoker=="yes", ]
nonsmoker <- insurance[insurance$smoker=="no", ]
```

### Smoker Data table
```{r show_table2}
datatable(smoker[1:500,], rownames=FALSE, colnames= c("age", "sex", "bmi", "children", "smoker", "region", "charges"), options=list(pageLength=20))
```

Column {data-width=500}
---

### Nonsmoker Data table
```{r show_table3}
datatable(nonsmoker[1:500,], rownames=FALSE, colnames= c("age", "sex", "bmi", "children", "smoker", "region", "charges"), options=list(pageLength=20))
```

Question 11
=== 

Column {data-width=500}
---

11. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?

### Analysis
This scatterplot is broken into two groups. The smooth line goes through the middle of these groups, so I don't think it's needed. However, the line does help make it more clear that there is a positive linear relationship so that is one benefit, but it is also something you can tell just from the points.

Column {data-width=500}
---

### Scatterplot Age vs Charges for Smokers
```{r scatter3}
ggplot(smoker, aes(x=age, y=charges))+
  geom_point(color="#90EE90")+
  geom_smooth(color="#EE6AA7")+
  labs(title="Age vs Charges for Smokers", x="Age", y="Charges")+
  theme(text=element_text(size=20))
```

Question 12&13
=== 

Column {data-width=500}
---

12. Repeat Question 11 using the data frame nonsmoker.

### Scatterplot Age vs Charges for Nonsmokers
```{r scatter4}
ggplot(nonsmoker, aes(x=age, y=charges))+
  geom_point(color="#607B8B")+
  geom_smooth(color="#EEAEEE")+
  labs(title="Age vs Charges for NonSmokers", x="Age", y="Charges")+
  theme(text=element_text(size=20))
```

Column {data-width=500}
---

### Analysis
This scatterplot is also broken up into two groups, but the lower group is much more condensed than the upper group. It is easy to tell the positive relationship in the lower group but not as much with the upper group which makes having the smooth line good.

### Question 13
13. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.

You could create a bar chart to show the charges for each gender or amount of children. A scatter plot could be created to compare charges to bmi. A box plot could be good to show the distribution of charges for each region.

Question 14
=== 

Column {data-width=500}
---

14. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)

### Analysis
From looking at this pie chart, it is obvious that most people have a lower number of children. The most people have zero children and the percentages get smaller as the amount of children goes up.

Column {data-width=500}
---

### Pie Chart for Number of Children
```{r pie}
children<- as.character(insurance$children)
children_count<- count(insurance, children)
children_count$percent<- round(children_count$n/sum(children_count$n)*100,2)
pie<- ggplot(children_count, aes(x="", y=percent, fill=children))+
      geom_bar(stat="identity", width=1, color="white")+
      labs(title=" Distribution of Number of Children")

pie<- pie + coord_polar("y", start=0)+
      geom_text(aes(label=paste0(percent,"%")),
                fontface="bold", color="white",
                position=position_stack(vjust=0.5))

pie<- pie + theme_void()+ theme(text= element_text(size=18))
pie
```

Question 15
=== 

Column {data-width=500}
---

15. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.

### Analysis
This boxplot actually surprised me because I expected the charges to be higher with the higher amount of children. However, the people with zero children actually have the highest charges. 2, 3, and 4 children are all similar and slightly higher than 1 child. 5 children is the lowest and has the least variation. There are many large outliers.

Column {data-width=500}
---

### Boxplot for Charges by Children
```{r boxplot2}
ggplot(insurance, aes(x=children, y=charges, group=children))+
  geom_boxplot(fill="#C6E2FF")+
  labs(title="Charges Based on Number of Children", x="Number of Children", y="Charges")+
  theme(text=element_text(size=16))
```